Take-home_Ex01

Author

Wan Xinyu

1. The task

The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.

2. Data preparation

2.1 Installing the data packages

pacman::p_load(ggplot2, ggiraph, plotly, 
               patchwork, DT, tidyverse,
               ggrepel, ggthemes, hrbrthemes,
               tidyverse,ggstatsplot,pals) 

The unzipped files have been saved into a new folder named data for better organization.

part_info <- read_csv("data/Participants.csv")

Participants.csv

Contains information about the residents of City of Engagement that have agreed to participate in this study.

  • participantId (integer): unique ID assigned to each participant.
  • householdSize (integer): the number of people in the participant’s household
  • haveKids (boolean): whether there are children living in the participant’s household.
  • age (integer): participant’s age in years at the start of the study.
  • educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
  • interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
  • joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")

FinancialJournal.csv

Contains information about financial transactions.

  • participantId (integer): unique ID corresponding to the participant affected
  • timestamp (datetime): the time when the check-in was logged
  • category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
  • amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link

Lets first examine the properties of the participants csv file.

datatable(part_info)
str(part_info)
spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Only displaying the meaningful summary

summary(part_info[, c("householdSize", "haveKids", "age", "educationLevel", "interestGroup", "joviality")])
 householdSize    haveKids            age        educationLevel    
 Min.   :1.000   Mode :logical   Min.   :18.00   Length:1011       
 1st Qu.:1.000   FALSE:710       1st Qu.:29.00   Class :character  
 Median :2.000   TRUE :301       Median :39.00   Mode  :character  
 Mean   :1.964                   Mean   :39.07                     
 3rd Qu.:3.000                   3rd Qu.:50.00                     
 Max.   :3.000                   Max.   :60.00                     
 interestGroup        joviality       
 Length:1011        Min.   :0.000204  
 Class :character   1st Qu.:0.240074  
 Mode  :character   Median :0.477539  
                    Mean   :0.493794  
                    3rd Qu.:0.746819  
                    Max.   :0.999234  

Count of resident for each interest group

table(part_info$interestGroup)

  A   B   C   D   E   F   G   H   I   J 
102  91 102  96  83 106 108 111  96 116 

Count of resident for each education level

table(part_info$educationLevel)

          Bachelors            Graduate HighSchoolOrCollege                 Low 
                232                 170                 525                  84 
sum(is.na(part_info))
[1] 0

Next we will visualize some variables in document to see if there are any anomalies. From the chart below, we see that there exist no participants with extremely large age or low

Show the code
p1 <- ggplot(part_info, aes(x = age)) +
  geom_bar() +
  labs(title = "Distribution of Participants' Age",
       x = "Age",
       y = "No. of person")

fig <- ggplotly(p1)

fig <- fig %>% 
  layout(xaxis = list(title = 'Age'), 
         yaxis = list(title = 'No. of person'))


p2 <- ggplot(part_info, aes(x = householdSize)) +
  geom_bar() +
  labs(title = "Distribution of house hold size",
       x = "Household size",
       y = "Count")

fig2 <- ggplotly(p2)

fig2 <- fig2 %>% 
  layout(xaxis = list(title = 'Household size'), 
         yaxis = list(title = 'Count'))



fig3 <- subplot(fig, fig2, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
  layout(title = 'Outlier checking',
         plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')) %>%
  
  layout(annotations = list(
      list(
        x = 0.2,  
        y = 1.0,  
        text = "Distribution of Participants' Age",  
        xref = "paper",  
        yref = "paper",  
        xanchor = "center",  
        yanchor = "bottom",  
        showarrow = FALSE 
      ),
      list(
        x = 0.8,  
        y = 1.0,  
        text = "Distribution of house hold size",  
        xref = "paper",  
        yref = "paper",  
        xanchor = "center",  
        yanchor = "bottom",  
        showarrow = FALSE 
      )
    ))

fig3
duplicates1 <- duplicated(part_info)
part_info[duplicates1, ]
# A tibble: 0 × 7
# ℹ 7 variables: participantId <dbl>, householdSize <dbl>, haveKids <lgl>,
#   age <dbl>, educationLevel <chr>, interestGroup <chr>, joviality <dbl>

Since 0 rows are displayed. We can confirm that duplicates do not exist in this csv file.

Lets move on to examine the properties of the finance csv file.

Sneak peak of the first few entries in the dataset

head(finance)
# A tibble: 6 × 4
  participantId timestamp           category  amount
          <dbl> <dttm>              <chr>      <dbl>
1             0 2022-03-01 00:00:00 Wage      2473. 
2             0 2022-03-01 00:00:00 Shelter   -555. 
3             0 2022-03-01 00:00:00 Education  -38.0
4             1 2022-03-01 00:00:00 Wage      2047. 
5             1 2022-03-01 00:00:00 Shelter   -555. 
6             1 2022-03-01 00:00:00 Education  -38.0
Warning

Please do not use datatable here or you will face a error of Fatal javascript OOM in reached Heap Limit which indicate that R studio session has run out of memory while attempting to execute JavaScript code.

str(finance)
spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
 $ timestamp    : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
 $ category     : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
 $ amount       : num [1:1513636] 2473 -555 -38 2047 -555 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   timestamp = col_datetime(format = ""),
  ..   category = col_character(),
  ..   amount = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Only displaying the meaningful summary

summary(finance[c("amount")])
     amount         
 Min.   :-1562.726  
 1st Qu.:   -5.594  
 Median :   -4.000  
 Mean   :   20.047  
 3rd Qu.:   21.598  
 Max.   : 4096.526  
table(finance$category)

     Education           Food     Recreation RentAdjustment        Shelter 
          3319         790051         296013            131          11463 
          Wage 
        412659 
sum(is.na(finance))
[1] 0

Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.

Show the code
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
  geom_boxplot() +
  xlab("Expense Category") +
  ylab("Amount") +
  ggtitle("Amount Spent by Expense Category"))

Now lets move to check for duplicates. The code below will display the duplicates in the financial_journal.csv

duplicates <- duplicated(finance)
finance[duplicates, ]
# A tibble: 1,113 × 4
   participantId timestamp           category   amount
           <dbl> <dttm>              <chr>       <dbl>
 1             0 2022-03-01 00:00:00 Shelter    -555. 
 2             0 2022-03-01 00:00:00 Education   -38.0
 3             1 2022-03-01 00:00:00 Shelter    -555. 
 4             1 2022-03-01 00:00:00 Education   -38.0
 5             2 2022-03-01 00:00:00 Shelter    -557. 
 6             2 2022-03-01 00:00:00 Education   -12.8
 7             3 2022-03-01 00:00:00 Shelter    -555. 
 8             3 2022-03-01 00:00:00 Education   -38.0
 9             4 2022-03-01 00:00:00 Shelter   -1556. 
10             4 2022-03-01 00:00:00 Education   -12.8
# ℹ 1,103 more rows

We will remove these duplicates and resave the file as finance using the following code

finance <- distinct(finance)

Lets just confirm that the duplicates are removed using the following code. The output of the following should be the same.

nrow(finance)
[1] 1512523
nrow(distinct(finance))
[1] 1512523
Warning

For better data presentation and consistency, we may encode all expenses into positive instead of negative using the code below. However, we will not do this as we will will analysis rent adjustments which has positive and negative and may cause confusion # Do NOT RUN

#finance$amount <- abs(finance$amount)

Checking for anomalies

Show the code
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  datatable()

Clicking from the above table, we do realize that most participant id has transaction in all 12 months but some only had transactions in one month. We should remove them as it is likely they are not residents

Show the code
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  filter(num_months != max(num_months)) %>%
  datatable()

We will move on to classify the above as non residents and remove them from the financial journal dataset.

# calculate num_months for each participant
monthly_counts <- finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarize(num_months = n_distinct(format(date, "%Y-%m")))

# find participants with num_months different from the maximum num_months
non_residents <- monthly_counts %>%
  filter(num_months != max(num_months))

# remove non-residents from the finance data frame
finance <- finance %>%
  filter(!participantId %in% non_residents$participantId)

A reduction in rows is observed below indicating removal of 131 residents.

nrow(distinct(finance))
[1] 1509897

We will also update out participant csv to remove the non residents

# remove non-residents from the finance data frame
part_info <- part_info %>%
  filter(!participantId %in% non_residents$participantId)

We can see a reduction in 131 rows as well

nrow(distinct(part_info))
[1] 880

3. Data visualization

Exploratory data analysis Lets now take alook at the demographic data provided in the dataset participants.csv

Show the code
# Histogram of age
v1<- ggplot(part_info, aes(x = age)) +
  geom_bar(binwidth = 5) +
  labs(title = "Distribution of Participant Age",
       x = "Age (years)",
       y = "Count")
ggplotly(v1)
Show the code
# Bar chart of education level
v2<- ggplot(part_info, aes(x = educationLevel)) +
  geom_bar() +
  labs(title = "Education Level of Participants",
       x = "Education Level",
       y = "Count")
ggplotly(v2)
Show the code
# Pie chart of household size
v3<- ggplot(part_info, aes(x = householdSize)) +
  geom_bar() +
  labs(title = "Household Size of Participants",
       x = "Household size",
       y = "Count")
ggplotly(v3)
Show the code
# Bar chart of whether participants have children
v4<- ggplot(part_info, aes(x = factor(haveKids))) +
  geom_bar() +
  labs(title = "Proportion of Participants with Children",
       x = "Have Children",
       y = "Count")
ggplotly(v4)
Show the code
v5 <- ggplot(data = part_info, aes(x = interestGroup)) +
      geom_bar(aes(text = paste("\n","Count: ", ..count.., "\n"," Percentage: ", scales::percent(..count../sum(..count..))))) +
      labs(title = "Interest Group Distribution", x = "Interest Group", y = "Count")

ggplotly(v5,tooltip = "text")
Show the code
v6<- ggplot(part_info, aes(x = joviality)) +
  geom_histogram(binwidth = 0.05, fill = "grey", color = "white") +
  labs(title = "Joviality Distribution", x = "Joviality", y = "Count")
ggplotly(v6)

3.1 Exploring participants data

Education vs age

Show the code
v7<- ggplot(part_info, aes(x = age, fill = educationLevel)) +
  geom_bar() +
  labs(title = "Relationship between Age and Education Level",
       x = "Age (years)",
       y = "Education Level")
v8<- ggplot(part_info, aes(x = age)) +
  geom_bar() +
  labs(title = "Age Distribution by Education Level",
       x = "Age (years)",
       y = "Count") +
  facet_wrap(~ educationLevel, ncol = 2)

v7 + v8

Note

Added fig.height to make sure that the charts are not overly compressed

We do not observe any distinct patterns and relationships between Age and Education for residents in town

Household size vs have kids

Show the code
ggplot(part_info, aes(x = haveKids, y = householdSize)) +
  geom_jitter() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Relationship between Household Size and Having Children",
       x = "Have Children",
       y = "Household Size")

From here we can observe a pattern that only household with 3 person have kids while those with 2 do not. Hence provison of subsidies such as milk and diaper vouchers should only be provisioned to family with more than 3 household members

Have kids vs education

Show the code
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

# convert educationLevel column to factor with desired levels
part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

# Calculate percentage of each education level group with children
edu_kids <- part_info %>%
  group_by(educationLevel, haveKids) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count))

# Plot the bar chart

ggplotly(ggplot(edu_kids, aes(x = educationLevel, y = percentage, fill = haveKids)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentage of Participants with Children by Education Level",
       x = "Education Level",
       y = "Percentage with Children") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9"), labels = c("No", "Yes")))

From the above we can observe that observe that generally percentage of residents who have kids are generally lower that than percentage of residents who have kids. There seems to be an inverse relationship between education level and if a resident has kids (with exception to high school).

Mean age vs having kids

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean age:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
  stat_summary(aes(y = age, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = age),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

From the chart we do not observe any significant difference.However we do note that mean age for having kids is around 38 years old which is pretty late.

ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = age,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
   message = TRUE ,
  xlab = "Having Kids",
  ylab = "Age"
)

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean age of resident who have and do not have kids are the same

Percentage of Participants in Each Interest Group by Education Level

Show the code
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

# convert educationLevel column to factor with desired levels
part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

grouped_data <- part_info %>%
  group_by(educationLevel, interestGroup) %>%
  summarise(count = n()) %>%
  mutate(percentage = prop.table(count) * 100)

# Create plot
plot <- ggplot(grouped_data, aes(x = educationLevel, y = percentage, fill = interestGroup)) +
  geom_col(position = "dodge") +
  labs(title = "Percentage of Participants in Each Interest Group by Education Level",
       x = "Education Level",
       y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +
  theme_economist_white()

# Convert to interactive plot
plotly_plot <- ggplotly(plot)
plotly_plot
Note

Define the desired order of the education levels using the edu_levels vector. Then, we convert the educationLevel column to a factor with the desired levels using the factor() function and the levels

There is no distinct patterns observed, activities seems to be quite well spread across residents of different education level

Mean joviality vs education

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = educationLevel),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We can observe from the above chart that mean joviality increases with the education level.

ggbetweenstats(
  data = part_info,
  x = educationLevel,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Education Level",
  ylab = "Joviality"
)

At a significance level of 5%, since the p value is smaller than 0.05, we reject the null hypothesis that the mean joviality of resident with different education are the same.

Mean joviality vs household size

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = householdSize),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We observe that mean joviality for single person household is higher than that of 2 and 3 person household and seems to display an decreasing trend with increasing number of members of the household.

ggbetweenstats(
  data = part_info,
  x = householdSize,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Household Size",
  ylab = "Joviality"
)

However, at a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of residents of different household size are the same.

Mean joviality vs have kids

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  havekids <- unique(part_info$haveKids)
  paste("Have Kids:", havekids, "<br>",
        "Mean Joviality:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  ) +
  labs(title = "Relationship between Having Kids and Joviality",
       x = "Have Kids",
       y = "Joviality")

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

Mean joviality is lowest for residents who have kids (3 member household) and this may be due to additional stress of taking care of them.

ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = joviality,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
  xlab = "Having Kids",
  ylab = "Joviality"
)

However, at a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who have and do not have kids are the same.

Mean joviality vs age

Lets first plot a graph

# create ggplot object
p <- ggplot(part_info, aes(x = age, y = joviality)) + 
  geom_point(alpha = 0.5)

# convert ggplot object to plotly object
ggplotly(p, tooltip = c("age", "joviality"))

We see that this plot itself is not really meaningful. We will use mean joviality to represent each age group.

Show the code
# group data by age and calculate mean joviality
age_joviality <- part_info %>% 
  group_by(age) %>% 
  summarize(mean_joviality = mean(joviality))

# create ggplot plot
ggplotly(ggplot(age_joviality, aes(x = age, y = mean_joviality)) +
  geom_smooth(method = "lm", se = FALSE) + geom_point() +
  labs(x = "Age", y = "Mean Joviality") +
  ggtitle("Relationship between Age and Mean Joviality"))

From the plot we do not see a distinct trend with age and mean joviality. However,we can observe that resident of age 53 has the lowest mean joviality and resident of age 59 has the highest mean joviality

ggscatterstats(
  data = age_joviality,
  x = age,
  y = mean_joviality,
  marginal = FALSE,
  )+labs(x = "Age", y = "Mean Joviality")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident is different across age group

Further examination
Show the code
# Add a column to indicate outliers using IQR method
is_outlier <- function(x) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x < (qnt[1] - H) | x > (qnt[2] + H)
}

# Create data subset for age 53
age53_joviality <- subset(part_info, age == 53)

# Create outlier column using is_outlier function
age53_joviality$outlier <- is_outlier(age53_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot53 <- ggplot(data = age53_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 53") +
  xlab("") +
  ylab("Joviality")

# Convert ggplot object to plotly
plotly_object <- ggplotly(plot53)


# Create data subset for age 59
age59_joviality <- subset(part_info, age == 59)

# Create outlier column using is_outlier function
age59_joviality$outlier <- is_outlier(age59_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot59 <- ggplot(data = age59_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 59") +
  xlab("") +
  ylab("Joviality")
# subplot for both plots
subplot(plot53, plot59, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
  layout(title = 'Further checking',
         plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')) %>%
  layout(annotations = list(
    list(
      x = 0.25,  
      y = 1.0,  
      text = "Distribution of Age 53 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    ),
    list(
      x = 0.75,  
      y = 1.0,  
      text = "Distribution of Age 59 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    )
  ))

We can observe that joviality for age 53 residents are more concentrated to below 0.5 while joviality for age 59 residents are more evenly distributed across the axis. This confirms our expectation that age 53 residents may indeed be generally not jovial while for age 59 residents, there is no general concentration of joviality observed.

Mean Joviality vs Interest group

Show the code
# Aggregate joviality by interest group
joviality_interest <- aggregate(joviality ~ interestGroup, data = part_info, mean)

# Plot mean joviality by interest group
ggplot(data = joviality_interest, aes(x = interestGroup, y = joviality)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(joviality, 2)), vjust = -0.5) +
  labs(x = "Interest Group", y = "Mean Joviality", 
       title = "Mean Joviality by Interest Group")

Interest group E provides the highest mean joviality. If officials are looking to improve joviality among residents, they can look into actions such as subsidizing group E membership, etc.

ggbetweenstats(
  data = part_info,
  x = interestGroup,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Interest Group",
  ylab = "Joviality")+
  scale_color_brewer(palette = "Set3")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who participate in different interest groups are the same.

Note

We use scale_color_brewer() as default palette only has 8 colors. We can define out own colors as well to be used in palette

3.2 Exploring financial data

Next we move on to explore variables in the financial data. Since every participant can have multiple entries. We will explore the data by grouping the entries according the participant’s Id and the category

Sum of residents expenditure by category

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
  group_by(participantId,category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
ggplotly(expenses_plot)

In the chart above we realize wage is counted as part of expenditure We will remove wage since it is not exactly an expense that we will be looking at.

Expenses by category (removing wage)

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
   filter(category != "Wage") %>%
  group_by(participantId, category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot1 <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
ggplotly(expenses_plot1)

From the chart above we realize that shelter accounts for the main category for expenditure. This is followed by recreation, food and education. Rent adjustment is on the negative end which may be a indication that the landlord in the city has overall lowered their rents.

Next we will examine the wages over the months across the time stamp period.

Show the code
# Convert the timestamp column to a datetime object
finance$timestamp <- as.POSIXct(finance$timestamp)

# Create a line chart of wages per month
wages_per_month <- finance %>%
  filter(category == "Wage") %>%
  group_by(month = floor_date(timestamp, "month")) %>%
  summarize(total_wages = sum(amount)) %>%
  ggplot(aes(x = month, y = total_wages)) +
  geom_line() +
  labs(x = "Month", y = "Total Wages", title = "Total Wages per Month") +
  scale_x_datetime(date_labels = "%b %Y")

# Display the plot
print(wages_per_month)

From the above chart we do realize that in March, total wage is abnormally high. This may be an anomaly that that may require further information to examine. For now, we will retain it as it is.

Average Amount by Category over Time

Show the code
# Aggregate financial data by category and timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  group_by(category, timestamp) %>%
  summarize(avg_amount = mean(amount))

# Line chart of average amount by timestamp, colored by category
line_chart <- ggplot(financial_data_agg, aes(x = timestamp, y = avg_amount, color = category)) +
  geom_line() +
  labs(x = "Timestamp", y = "Average Amount", title = "Average Amount by Category over Time") +
  theme_minimal()

# Display the chart
ggplotly(line_chart)

From the above chart, we observe that mapping average amount for category by days in the time stamp is not visually pleasing to see any patterns. But still if we zoom in, we can get some info from it. That is we can see that there are some major fluctuation in shelter amount and rent adjustment in march and April. Education amount is incurred on first day of the month and that if we zoom in. We can see that recreation and food demonstrates a regular pattern as shown in the pic below

Lets try to map above information in terms of months

Average Amount Spent by Category per Month

Show the code
# Extract month from timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  mutate(month = format(timestamp, "%Y-%m")) %>%
  group_by(participantId, category, month) %>%
  summarize(total = sum(amount), .groups = "drop")

# Calculate average amount spent by category per month
category_month_avg <- financial_data_agg %>%
  group_by(category, month) %>%
  summarize(avg_amount = mean(total))

# Create bar chart
category_month_plot <- ggplot(category_month_avg, aes(x = month, y = avg_amount, fill = category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Month", y = "Average Amount Spent", title = "Average Amount Spent by Category per Month") +
  theme(axis.text.x = element_text(angle = 45,
                                   vjust = 0.5,
                                   hjust = 0.5))

ggplotly(category_month_plot)
Note

Using theme minimal will result in overlapping of x-axis. I use theme here so that angle of x-axis label can be adjusted 45 degrees

From the chart above we observe that the amount spent on average per category is indeed in the following order with education < food < recreation < shelter. We can also observe some anomalies in March and April. There is an exceeding large expenditure on shelter and recreation in march and in April the rent adjustment increased exceptionally as well. The expenditure in shelter may have being transferred to increase rent adjustment.

Note

Since not everyone in the city is a landlord / tenant. This chart only serve as a benchmark of average city resident.

3.3 Merged visualization

Now lets see some visualizations after we merge these 2 files together using the code below

Show the code
financial_journal <- finance %>%
  mutate(date = as.Date(timestamp))

# Aggregate financial data by participantId and category
agg_financial <- financial_journal %>%
  group_by(participantId, category) %>%
  summarize(total_spent = sum(amount), .groups = "drop")

# Merge demographic data with aggregated financial data
merged_data <- part_info %>%
  left_join(agg_financial, by = "participantId")

We can see the residents’ total expenditures in each category and their respective wage using the below code.

DT::datatable(merged_data[, c("participantId", "category", "total_spent")], class = 'compact')

Lets move on to see some EDAs

Wage agaist joviality

Show the code
# Subset merged_data to only include rows where category is "wage"
wage_data <- merged_data %>% filter(category == "Wage")

# Create scatterplot of wage vs. joviality
ggplotly(ggplot(wage_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(title = "Wage vs. Joviality", x = "Wage", y = "Joviality")+ geom_point())

Using the smooth tend line, we can observe that joviality generally decrease with increasing wage. This may be because a higher wage may mean more responsibility at work and hence greater amount of stress which lead to lower joviality

ggscatterstats(
  data = wage_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Wage", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and wage

Wage against education

Show the code
wage_data <- merged_data %>% filter(category == "Wage")

ggplot(wage_data, aes(x = educationLevel, y = total_spent)) +
  geom_point() +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Education Level", y = "Wage")

From the chart above we can observe increasing education level of residents has increasing level of wage

Wage against interest group

Show the code
wage_data <- merged_data %>% filter(category == "Wage")

ggplot(wage_data, aes(x = interestGroup, y = total_spent)) +
  geom_point(alpha = 0.5, width = 0.2) +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Interest Group", y = "Wage")

From the above chart, wage distribution across interest group seems to be well spread. But Interest group I seems to have attracted on one of the hugest wage resident in the city.

Lets check for the correlations

pacman::p_load(readxl, performance, parameters, see)
model <- lm(total_spent ~ interestGroup + educationLevel, data = wage_data)
model

Call:
lm(formula = total_spent ~ interestGroup + educationLevel, data = wage_data)

Coefficients:
                      (Intercept)                     interestGroupB  
                         35276.37                           -2849.09  
                   interestGroupC                     interestGroupD  
                         -2936.16                             583.35  
                   interestGroupE                     interestGroupF  
                         -3019.36                           -5420.24  
                   interestGroupG                     interestGroupH  
                         -4753.63                            -993.26  
                   interestGroupI                     interestGroupJ  
                            58.03                           -2917.00  
educationLevelHighSchoolOrCollege            educationLevelBachelors  
                          5816.44                           28023.60  
           educationLevelGraduate  
                         42627.88  
check_collinearity(model)
# Check for Multicollinearity

Low Correlation

           Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
  interestGroup 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
 educationLevel 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
Show the code
check_c <- check_collinearity(model)
plot(check_c)

Percentage of spending within each category for each group

Show the code
spending_by_kids <- merged_data %>%
  filter(category != "Wage") %>%
  group_by(haveKids, category) %>%
  summarise(total_spending = sum(total_spent)) %>%
  mutate(percentage = total_spending/sum(total_spending)*100)

# Convert haveKids to a factor for easier plotting
spending_by_kids$haveKids <- as.factor(spending_by_kids$haveKids)

# Plot using ggplot2 and ggplotly
spending_by_kids_plot <- ggplot(spending_by_kids, aes(x = haveKids, y = percentage, fill = category)) +
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_viridis_d() +
  theme_minimal() +
  labs(title = "Percentage of Spending by Category and Household Type",
       x = "Household Type",
       y = "Percentage of Spending") +
  guides(fill = guide_legend(title = "Spending Category"))

ggplotly(spending_by_kids_plot)

We can observe that those who do not spend on education at all while household which do have kids has greater spending on shelter and food and lower on recreation.

Spending category by education level

Show the code
spending_data <- merged_data %>% filter(category != "Wage")

ggplot(spending_data, aes(x = category, y = total_spent, fill = category)) +
  geom_col(position = "dodge") +
  facet_wrap(~educationLevel, ncol = 2) +
  scale_fill_discrete(name = "Spending Category") +
  labs(x = "Spending Category", y = "Amount")

From the chart we can observe that, with the exception of rent adjustment, spending pattern by category across different education level is generally the same.

Recreation spending vs joviality

recreation_data <- merged_data %>%
  filter(category == "Recreation")

# Create scatter plot of joviality vs recreation spending
ggplot(recreation_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(x = "Recreation Spending", y = "Joviality") + geom_point()

From the observation it seems that increasing spending would increase joviality until a saturation is reached at around $10000. Note that the chart is plotted on a negative x-axis as spending is accounted for as negative value.

ggscatterstats(
  data = recreation_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Recreation", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and recreation expenses

Conclusion

In conclusion, I believe that the authority there are a few points in which the authorities can look at when allocating the grant.

  • Greater funds should be channeled to support the education of residents given that it is the lowest expenditure. Currently education is only observed in residents who has kids and I assume that this expenditure is on the kids only. Perhaps the authorities can look at sponsoring adults continuous learning as well.

  • Mean joviality increases with increase increasing education and that this may be a reason to further support fundings for education purposes.

  • Joviality has an inverse relationship with wage and the manpower authority should perhaps examine the current laws and manpower landscape to determine if employees are properly treated in workplace

  • The mean age of residents having kids is 38 and the percentage of residents who have kids decreases as their education level increase. Targeted family development measures should be put in place if authorities seeks to reduce this age and encourage fertility rate among certain education group level

  • Shelter accounts for the largest expenditure in the category and residents who have kids spends on a larger percentage for shelter. Since shelter is a fundamental necessity, the authorities can perhaps consider rental subsidies and support for qualified residents.

  • Increasing recreation expenses improves joviality to a certain extent. If the city want to improve the happiness of the residents, perhaps they can consider measures such as free tours to parks,etc

To sum up, I have examined some variables and their correlation to each another. I have also provided some areas in which the authority can look at to disburse their city renewal grant and revitalize the community. However, as correlation does not equal causation, the above only serves as a reference to which the authority can look at. Further examination with more in dept data are required to understand the community and solve the problems more efficiently .